RCrawler101-201605 (Week2)

Leo Lu

(This material is modified from Mansun Kuo’s work)



RCrawler101-201605 (Week2)

2016-05-28

Leo Lu

Slides
http://bit.ly/RC101-201605-W2
http://bit.ly/安裝R

How to use this slides

Download this slides

here (Right Click > Save As…)

課程資訊

網站/ 論壇/ 粉絲頁/ 廣播/ 共筆

課後若有任何問題歡迎至論壇發問

關於教材授權

本教材之智慧財產權,屬木刻思股份有限公司所有

如果有朋友,覺得此教材很棒,希望能分享給朋 友,或是拿此教材開課。非常歡迎大家來信至 course@agilearning.io 請求教材的使用授權唷!

This material is modified from Mansunkuo’s work.
All rights reserved by Agilearning.

Outline

Install: Chrome Extension

Architecture of Crawlers

Recap: Observation Skills and HTTP Request for Connection

R Packages Required

Pipeline Coding

Crawler’s toolkits in R

misc

Install Packages

(if you haven’t installed those packages yet)

There are system dependency when installing devtools

Installation guide from R basic class

## === install required packages ===
pkg_list <- c("magrittr", "httr", "rvest", "stringr", "data.table",
              "jsonlite", "RSQLite", "devtools")
pkg_new <- pkg_list[!(pkg_list %in% installed.packages()[,"Package"])]
if(length(pkg_new)) install.packages(pkg_new)
if("xmlview" %in% pkg_new) {devtools::install_github("hrbrmstr/xmlview")}
if("data.table" %in% pkg_new) {
    install.packages("data.table", type = "source",
                      repos = "https://Rdatatable.github.io/data.table")
} else if (packageDescription("data.table")$Version < "1.9.7") {
    install.packages("data.table", type = "source",
                      repos = "https://Rdatatable.github.io/data.table")
}
rm(pkg_new, pkg_list)

Let’s Rock with R!

Hello RStudio

rstudio

RStudio Settings

Must-known keyboard shortcuts

All RStudio keyboard shortcuts

Description Windows & Linux Mac
Attempt completion / Indent Tab Tab
Run current line/selection Ctrl+Enter +↩︎
Comment/uncomment current line/selection Ctrl+Shift+C ++C
Reindent lines Ctrl+I +I
Insert pipe operator Ctrl+Shift+M ++M

R recap

How to get help

Working Environment

Check your working directory everytime you start to work!

Using getwd/setwd to get/set your working directory.

RStudio will set working directory automatically when opening new files

If you use Projects, RStudio will change working directory for you automatically.

Basic Data Structure

Vector, Matrix, Array, List and Data frame are the most basic data structure in R. These data structures can be mapped into a table according to:

Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array

(Atomic) Vector

v1 <- c(1:10)
v1
#>  [1]  1  2  3  4  5  6  7  8  9 10
is.vector(v1)
#> [1] TRUE
length(v1)
#> [1] 10
s1 <- 2
s1
#> [1] 2
is.vector(s1)
#> [1] TRUE
length(s1)
#> [1] 1

List

Lists are also vectors, but not atomic vectors. Lists are generic vectors, with (naturally) different semantics.

Elements in a list can be any kinds type and its length is arbitrary.

Function str can help you investigate the structure of a nested list.

li <- list(a = 1:10, 
           b = c("apple", "banana"))
str(li)
#> List of 2
#>  $ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
#>  $ b: chr [1:2] "apple" "banana"
li2 <- list(li = li, 
            c = matrix(1:4, nrow = 2))
str(li2)
#> List of 2
#>  $ li:List of 2
#>   ..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
#>   ..$ b: chr [1:2] "apple" "banana"
#>  $ c : int [1:2, 1:2] 1 2 3 4

Visualising lists

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

Lists Subsetting

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a[1:2])
str(a[4])
y <- list("a", 1L, 1.5, TRUE)
str(y[[1]])
str(y[[4]])
a$a
a[["b"]]

Lists Subsetting (Pepper Shaker)

Data Frame

Data frame is a 2-dimension data structure to deal with a table-like heterogeneous data.

df <- data.frame(gender = c("male", "female", "female", "male"),
                 age = c(33, 18, 24, 26))
## Add new column in a data frame
df$city <- c("Taipei", "Taipei", "Hsinchu", "Taichung")
df
#>   gender age     city
#> 1   male  33   Taipei
#> 2 female  18   Taipei
#> 3 female  24  Hsinchu
#> 4   male  26 Taichung
str(df)
#> 'data.frame':    4 obs. of  3 variables:
#>  $ gender: Factor w/ 2 levels "female","male": 2 1 1 2
#>  $ age   : num  33 18 24 26
#>  $ city  : chr  "Taipei" "Taipei" "Hsinchu" "Taichung"

Recap: data structure

All data structures above are objects. They apply different methods and saved as different type internally.

object type class
c(1, 2.5, 3) double numeric
c(“male”, “female”, “female”, “male”) character character
factor(c(“male”, “female”, “female”, “male”)) integer factor
matrix(1:9, nrow = 3) integer matrix
list(a = 1:10, b = c(“apple”, “banana”)) list list
data.frame(a = 1, b = “z”) list data.frame

Function

To understand computations in R, two slogans are helpful:

John Chambers


`+`
#> function (e1, e2)  .Primitive("+")
`<-`
#> .Primitive("<-")
`[`
#> .Primitive("[")
`c`
#> function (..., recursive = FALSE)  .Primitive("c")

Function in R

A typical function in R may look like:

f <- function(par1, par2, ...) {
    # Some magic happened
    return(sth)    # return something
}

Control Flow

If

The basic structure of conditional execution in R is:

if (an expression returns TRUE or FALSE) {
    # do something
} else if (another expression returns TRUE or FALSE) {
    # do something
} else {
    # do something
}

for

Iterate items in R.

# iterate a character vector
for (i in c("a", "b")) {
    print(i)
}
#> [1] "a"
#> [1] "b"
# nested loop
m <- matrix(numeric(), nrow = 2, ncol = 2)
for (i in 1:nrow(m)) {
    for (j in 1:ncol(m)) {
        m[i, j] <- i * j
    }
}
m
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    2    4

tryCatch

tryCatch({
  result <- expr
  # If you want to use more than one 
  # R expression in the "try" part then you'll have to 
  # use curly brackets.
  # 'tryCatch()' will return the last evaluated expression 
  # in case the "try" part was completed successfully
}, warning = function(w) {
  message("Here's the original warning message:")
  message(w)
  # Choose a return value in case of warning
  return(NULL)
}, error = function(e) {
  message("Here's the original error message:")
  message(e)
  # Choose a return value in case of error
  return(NA)
}, finally {
  message("Some other message at the end")
  # finally:
  # Here goes everything that should be executed at the end,
  # regardless of success or error.
  # Usually it's used for releasing resources.
  # If you want more than one expression to be executed, then you 
  # need to wrap them in curly brackets ({...}); otherwise you could
  # just have written 'finally=<expression>' 
})

https://stackoverflow.com/questions/12193779/how-to-write-trycatch-in-r/12195574#12195574

magrittr

magrittr logo

What is magrittr

  1. (LHS) will be piped in as the first argument of the function on the (RHS) with %>%.
  2. Use the dot, ., as placeholder in a expression.


Example

library(magrittr)
iris %>% head(5)
mtcars %>%
  subset(hp > 100) %>% 
  .[c("mpg", "cyl", "hp")]

Workflow of Crawler Design

Architecture of Crawlers

Workflow

  1. 找到資料頁,想像資料要長什麼樣子,設想產出的資料格式(schema)
  2. 觀察網頁內容,找到資料所在的request/response,再一層層往上解析,套上判斷式及迴圈, 完成爬蟲的自動化。
  3. 解析取得的資料。

Crawler’s toolkits in R



Web Connector in R

HTTP request

A valid HTTP request includes four things:

Web Connector in R

httr

Connection: GET Method

Use GET() to request data from a specific resource

起手式

## Not Run
library(httr)
res <- GET(
  url = "http://httpbin.org/get",
  add_headers(a = 1, b = 2),
  set_cookies(c = 1, d = 2),
  query = list(q="hihi")
)
content(res, as = "text", encoding = "UTF-8")
content(res, as = "parsed", encoding = "UTF-8")

一個例子學會第一隻爬蟲

library(magrittr)
library(httr)
library(rvest)
#> Loading required package: xml2

## Connection
url <- "https://www.ptt.cc/bbs/Gossiping/index.html"
res <- GET(url, 
           set_cookies(over18="1"))  # over18 cookie

## (Try get post titles)
res %>% 
  content(as = "text", encoding = "UTF-8") %>% 
  `Encoding<-`("UTF-8") %>% 
  read_html %>% 
  html_nodes(css = ".title a") %>% 
  html_text()
#>  [1] "[問卦] 首爾=彥州!?"                                         
#>  [2] "[問卦] Re: Fw: [爆卦] 解放軍少將辛旗承認課綱微調是他主導的!"
#>  [3] "Re: [問卦] 有沒有南加州大學知名校友的八卦?"                 
#>  [4] "[問卦] 陳建州 VS 陳彥州 誰贏?"                              
#>  [5] "[問卦] 8+9宮廟今天有什麼活動"                               
#>  [6] "Re: [問卦] 有沒有麥當當每年都會出現可樂杯的八卦"            
#>  [7] "Re: [新聞] 立院三讀《刑法》配套中 國民黨團突襲打"           
#>  [8] "[問卦] 請問有沒有諺文的八卦"                                
#>  [9] "[問卦] 有沒有超市的生鮮蔬菜的八卦?"                        
#> [10] "[問卦] 有沒有神被打落凡間的故事?"                          
#> [11] "Re: [新聞] 藍綠惡鬥 陸歸派不敢講真話"                       
#> [12] "Re: [問卦] 鼻塞怎麼辦"                                      
#> [13] "[新聞] 陸委會:兩岸關係就是兩岸關係"                        
#> [14] "Re: [問卦] 大家這樣害obov失業怎麼辦?"                      
#> [15] "[新聞] 施政一周 「政策看報才知道」 林全拜碼頭"              
#> [16] "[問卦] 有沒有KENZO的八卦?"                                 
#> [17] "[問卦] 美國哪個州最天龍"                                    
#> [18] "[公告] 八卦板板規(2016.02.16)"                              
#> [19] "[協尋]行車記錄器-5/14承德路5段與基河路交叉口"               
#> [20] "[公告] 數字類跟風問卦視為鬧板"                              
#> [21] "[協尋] 5/27早6點台中五權南路行車記錄"                       
#> [22] "[公告] 五月份置底閒聊文"

Cookies

set_cookies(a = 1, b = 2)
set_cookies(.cookies = c(a = "1", b = "2"))
library(httr)
url <- "https://www.ptt.cc/bbs/Gossiping/index.html"
res <- GET(url, 
           set_cookies('over18'='1'))  # over18 cookie

The response status code

url <- "https://www.ptt.cc/bbs/Gossiping/index.html"
res <- GET(url, 
           set_cookies('over18'='1'))  # over18 cookie

# Get an informative description:
http_status(res)
#> $category
#> [1] "Success"
#> 
#> $reason
#> [1] "OK"
#> 
#> $message
#> [1] "Success: (200) OK"

# Or just access the raw code:
res$status_code
#> [1] 200
status_code(res)
#> [1] 200

# highly recommend using one of these functions whenever you're using httr inside a function to make sure you find out about errors as soon as possible.
warn_for_status(res)
stop_for_status(res)

Set header

Sometimes you may need to provide appropriate HTTP header fields with add_headers() to make a request.

## Not run
add_headers(a = 1, b = 2)
add_headers(.headers = c(a = "1", b = "2"))

Connection: POST method

起手式

## Not Run
library(httr)
library(rvest)

res <- POST(url = "http://httpbin.org/post",
            add_headers(a = 1, b = 2),
            set_cookies(c = 1, d = 2),
            body = "x=hello&y=hihi")  # raw string (need URLencode)

res <- POST(url = "http://httpbin.org/post",
            add_headers(a = 1, b = 2),
            set_cookies(c = 1, d = 2),
            body = list(x = "hello", 
                        y = "hihi"), # form data as list
            encode = "form")

content(res, as = "text", encoding = "UTF-8")
content(res, as = "parsed", encoding = "UTF-8")

一個例子學會第二隻爬蟲: Guestbook

Guestbook

Exercise: Guestbook

Try to post a message in App Engine GuestBook

10-min break

Let’s see what we got in the
response body

The Response Body

3 ways to access the body of the request with httr::content():

The Response Body

httr::content()

res <- GET("https://www.ptt.cc/bbs/NBA/M.1463302851.A.8D7.html")
(bin <- content(res, as = "raw")) %>% head(200) # raw (binary) vector
#>   [1] 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 0a 3c 68 74 6d 6c 3e 0a
#>  [24] 09 3c 68 65 61 64 3e 0a 09 09 3c 6d 65 74 61 20 63 68 61 72 73 65 74
#>  [47] 3d 22 75 74 66 2d 38 22 3e 0a 09 09 0a 0a 3c 6d 65 74 61 20 6e 61 6d
#>  [70] 65 3d 22 76 69 65 77 70 6f 72 74 22 20 63 6f 6e 74 65 6e 74 3d 22 77
#>  [93] 69 64 74 68 3d 64 65 76 69 63 65 2d 77 69 64 74 68 2c 20 69 6e 69 74
#> [116] 69 61 6c 2d 73 63 61 6c 65 3d 31 22 3e 0a 0a 3c 74 69 74 6c 65 3e 5b
#> [139] e8 a8 8e e8 ab 96 5d 20 e6 9e 97 e6 9b b8 e8 b1 aa e5 a6 82 e6 9e 9c
#> [162] e5 8a a0 e5 85 a5 e9 a6 ac e5 88 ba e8 83 bd e5 be 97 e5 88 b0 e5 a4
#> [185] 9a e5 b0 91 e4 b8 8a e5 a0 b4 e6 99 82 e9 96 93
# writeBin(bin, "myfile.txt") # the highest fidelity way of saving files to disk
content(res, as = "text", encoding = "UTF-8") %>%  # accesses the body as a character vector
  `Encoding<-`("UTF-8")
#> [1] "<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<meta charset=\"utf-8\">\n\t\t\n\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n\n<title>[討論] 林書豪如果加入馬刺能得到多少上場時間? - 看板 NBA - 批踢踢實業坊</title>\n<meta name=\"robots\" content=\"all\">\n<meta name=\"keywords\" content=\"Ptt BBS 批踢踢\">\n<meta name=\"description\" content=\"\n\n\n最近有傳言說林書豪可能會加入馬刺\n\n\">\n<meta property=\"og:site_name\" content=\"Ptt 批踢踢實業坊\">\n<meta property=\"og:title\" content=\"[討論] 林書豪如果加入馬刺能得到多少上場時間?\">\n<meta property=\"og:description\" content=\"\n\n\n最近有傳言說林書豪可能會加入馬刺\n\n\">\n<link rel=\"canonical\" href=\"https://www.ptt.cc/bbs/NBA/M.1463302851.A.8D7.html\">\n\n<link rel=\"stylesheet\" type=\"text/css\" href=\"//images.ptt.cc/v2.17/bbs-common.css\">\n<link rel=\"stylesheet\" type=\"text/css\" href=\"//images.ptt.cc/v2.17/bbs-base.css\" media=\"screen\">\n<link rel=\"stylesheet\" type=\"text/css\" href=\"//images.ptt.cc/v2.17/bbs-custom.css\">\n<link rel=\"stylesheet\" type=\"text/css\" href=\"//images.ptt.cc/v2.17/pushstream.css\" media=\"screen\">\n<link rel=\"stylesheet\" type=\"text/css\" href=\"//images.ptt.cc/v2.17/bbs-print.css\" media=\"print\">\n\n\n<script src=\"//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js\"></script>\n<script src=\"//images.ptt.cc/v2.17/bbs.js\"></script>\n\n\n\t\t\n\n<script type=\"text/javascript\">\n\n  var _gaq = _gaq || [];\n  _gaq.push(['_setAccount', 'UA-32365737-1']);\n  _gaq.push(['_setDomainName', 'ptt.cc']);\n  _gaq.push(['_trackPageview']);\n\n  (function() {\n    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n  })();\n\n</script>\n\n\n\t</head>\n    <body>\n\t\t\n<div id=\"fb-root\"></div>\n<script>(function(d, s, id) {\nvar js, fjs = d.getElementsByTagName(s)[0];\nif (d.getElementById(id)) return;\njs = d.createElement(s); js.id = id;\njs.src = \"//connect.facebook.net/en_US/all.js#xfbml=1\";\nfjs.parentNode.insertBefore(js, fjs);\n}(document, 'script', 'facebook-jssdk'));</script>\n\n<div id=\"topbar-container\">\n\t<div id=\"topbar\" class=\"bbs-content\">\n\t\t<a id=\"logo\" href=\"/\">批踢踢實業坊</a>\n\t\t<span>&rsaquo;</span>\n\t\t<a class=\"board\" href=\"/bbs/NBA/index.html\"><span class=\"board-label\">看板 </span>NBA</a>\n\t\t<a class=\"right small\" href=\"/about.html\">關於我們</a>\n\t\t<a class=\"right small\" href=\"/contact.html\">聯絡資訊</a>\n\t</div>\n</div>\n<div id=\"navigation-container\">\n\t<div id=\"navigation\" class=\"bbs-content\">\n\t\t<a class=\"board\" href=\"/bbs/NBA/index.html\">返回看板</a>\n\t\t<div class=\"bar\"></div>\n\t\t<div class=\"share\">\n\t\t\t<span>分享</span>\n\t\t\t<div class=\"fb-like\" data-send=\"false\" data-layout=\"button_count\" data-width=\"90\" data-show-faces=\"false\" data-href=\"http://www.ptt.cc/bbs/NBA/M.1463302851.A.8D7.html\"></div>\n\n\t\t\t<div class=\"g-plusone\" data-size=\"medium\"></div>\n<script type=\"text/javascript\">\nwindow.___gcfg = {lang: 'zh-TW'};\n(function() {\nvar po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;\npo.src = 'https://apis.google.com/js/plusone.js';\nvar s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);\n})();\n</script>\n\n\t\t</div>\n\t</div>\n</div>\n<div id=\"main-container\">\n    <div id=\"main-content\" class=\"bbs-screen bbs-content\"><div class=\"article-metaline\"><span class=\"article-meta-tag\">作者</span><span class=\"article-meta-value\">magneto5566 (萬磁王5566_麥寮法斯賓達)</span></div><div class=\"article-metaline-right\"><span class=\"article-meta-tag\">看板</span><span class=\"article-meta-value\">NBA</span></div><div class=\"article-metaline\"><span class=\"article-meta-tag\">標題</span><span class=\"article-meta-value\">[討論] 林書豪如果加入馬刺能得到多少上場時間?</span></div><div class=\"article-metaline\"><span class=\"article-meta-tag\">時間</span><span class=\"article-meta-value\">Sun May 15 17:00:48 2016</span></div>\n\n\n最近有傳言說林書豪可能會加入馬刺\n\n不管對林書豪或對球迷來說馬刺都是個很棒的選擇\n\n馬刺今年例行賽戰績高達67勝,主力陣容維持的話明年應該依舊有60勝左右\n\n雖然波波維奇用兵如神,但林書豪在馬刺到底能獲得多少上場時間呢\n\n林書豪應該是替補PG和SG的位置\n\n單看馬刺和雷霆的系列賽\n\n馬刺的後場部分主要由4位球員分擔時間\n\n先發: Danny Green , Tony Parker\n\n替補: Manu Ginobili, Patty Mills\n\n\n理論上 林書豪最有機會的是Ginobili的位置\n\n畢竟鬼切已經38歲,下季會不會回來都還不確定\n\n林書豪的切入能力雖然沒有巔峰的鬼切犀利\n\n但是大勝38歲鬼切應該是沒問題\n\n相信波波維奇很了解這一點\n\n不知道大家怎麼看呢?\n\n\n--\n<span class=\"f2\">※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 140.112.71.9\n</span><span class=\"f2\">※ 文章網址: <a href=\"https://www.ptt.cc/bbs/NBA/M.1463302851.A.8D7.html\" target=\"_blank\" rel=\"nofollow\">https://www.ptt.cc/bbs/NBA/M.1463302851.A.8D7.html</a>\n</span><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">sss1234     </span><span class=\"f3 push-content\">: 87分鐘</span><span class=\"push-ipdatetime\"> 05/15 17:02\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">h22212247888</span><span class=\"f3 push-content\">: 8.7分鐘</span><span class=\"push-ipdatetime\"> 05/15 17:02\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">rick        </span><span class=\"f3 push-content\">: 等七月再來分析好嗎?</span><span class=\"push-ipdatetime\"> 05/15 17:02\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">demonh311   </span><span class=\"f3 push-content\">: 當然是當基石</span><span class=\"push-ipdatetime\"> 05/15 17:02\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">arcss       </span><span class=\"f3 push-content\">: 前半年冰凍,後半年15分鐘/場</span><span class=\"push-ipdatetime\"> 05/15 17:03\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">hunt5566    </span><span class=\"f3 push-content\">: 呵呵 波波又藏招不讓豪鬼上囉</span><span class=\"push-ipdatetime\"> 05/15 17:04\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">Xenogamer   </span><span class=\"f3 push-content\">: 0</span><span class=\"push-ipdatetime\"> 05/15 17:04\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">love1500274 </span><span class=\"f3 push-content\">: Pop會讓書豪取代跑車先發 書豪會拿下年度mvp</span><span class=\"push-ipdatetime\"> 05/15 17:05\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">littlegreen </span><span class=\"f3 push-content\">: 馬刺需要切入沒錯 但是切傳能力 這書豪比較弱</span><span class=\"push-ipdatetime\"> 05/15 17:06\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">噓 </span><span class=\"f3 hl push-userid\">r30385      </span><span class=\"f3 push-content\">: 8.7秒</span><span class=\"push-ipdatetime\"> 05/15 17:08\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">donnylee    </span><span class=\"f3 push-content\">: Mills也差不多要退了,40歲</span><span class=\"push-ipdatetime\"> 05/15 17:12\n</span></div>         Mills 27歲而已\n<span class=\"f2\">※ 編輯: magneto5566 (140.112.71.9), 05/15/2016 17:14:25\n</span><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">darkfeather </span><span class=\"f3 push-content\">: 美媒建議變傳言,這進化速度還真快</span><span class=\"push-ipdatetime\"> 05/15 17:15\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">donnylee    </span><span class=\"f3 push-content\">: 喔,我記錯了,那是哪一個40?</span><span class=\"push-ipdatetime\"> 05/15 17:18\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">Mooooose    </span><span class=\"f3 push-content\">: Popo黑掉的所花的時間會比較多嗎?</span><span class=\"push-ipdatetime\"> 05/15 17:18\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">erotica     </span><span class=\"f3 push-content\">: 阿豪打1 2號都可以  還可以各種大小鎖防守</span><span class=\"push-ipdatetime\"> 05/15 17:21\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">噓 </span><span class=\"f3 hl push-userid\">king181239  </span><span class=\"f3 push-content\">: 去專版問啊</span><span class=\"push-ipdatetime\"> 05/15 17:21\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">king181239  </span><span class=\"f3 push-content\">: 他們會說超過35分鐘</span><span class=\"push-ipdatetime\"> 05/15 17:21\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">erotica     </span><span class=\"f3 push-content\">: 屌打上述4個人</span><span class=\"push-ipdatetime\"> 05/15 17:21\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">king181239  </span><span class=\"f3 push-content\">: 事實上只能出賽20幾</span><span class=\"push-ipdatetime\"> 05/15 17:22\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">噓 </span><span class=\"f3 hl push-userid\">s925407     </span><span class=\"f3 push-content\">: 怎麼會有人說Mills40歲啦,一日球迷也沒這麼不專業</span><span class=\"push-ipdatetime\"> 05/15 17:22\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">s925407     </span><span class=\"f3 push-content\">: 啦</span><span class=\"push-ipdatetime\"> 05/15 17:22\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">a10141013   </span><span class=\"f3 push-content\">: 馬刺大約20-25分鐘 然後popo被說冰箱</span><span class=\"push-ipdatetime\"> 05/15 17:24\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">dream1285201</span><span class=\"f3 push-content\">: 一個球員會黑掉大多跟他的球迷有關</span><span class=\"push-ipdatetime\"> 05/15 17:24\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">smtp        </span><span class=\"f3 push-content\">: 重點是馬刺想出多少爭取豪哥?</span><span class=\"push-ipdatetime\"> 05/15 17:28\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">kl015013    </span><span class=\"f3 push-content\">: 20</span><span class=\"push-ipdatetime\"> 05/15 17:29\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">bycarbird   </span><span class=\"f3 push-content\">: 0,因為書豪不是波波的菜</span><span class=\"push-ipdatetime\"> 05/15 17:30\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">MoMovincent </span><span class=\"f3 push-content\">: 那個說40歲的應該是要說a米吧</span><span class=\"push-ipdatetime\"> 05/15 17:30\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">s95115260   </span><span class=\"f3 push-content\">: 馬刺怎麼可能要球權的球員</span><span class=\"push-ipdatetime\"> 05/15 17:31\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">love1500274 </span><span class=\"f3 push-content\">: 他想說的是Miller 打成Mills  lol</span><span class=\"push-ipdatetime\"> 05/15 17:32\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">kiwi1220    </span><span class=\"f3 push-content\">: 40是 ,Miller</span><span class=\"push-ipdatetime\"> 05/15 17:33\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">miler22020  </span><span class=\"f3 push-content\">: 20分鐘 球權不多 然後popo從此黑掉</span><span class=\"push-ipdatetime\"> 05/15 17:36\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">scatology   </span><span class=\"f3 push-content\">: 5 垃圾時間</span><span class=\"push-ipdatetime\"> 05/15 17:36\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">smtp        </span><span class=\"f3 push-content\">: 10M簽的話, 上場時間不可能只有20分鐘...</span><span class=\"push-ipdatetime\"> 05/15 17:37\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">forker561   </span><span class=\"f3 push-content\">: 第一年不會有上場時間 最快第二年才有上場時間</span><span class=\"push-ipdatetime\"> 05/15 17:38\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">candy9999   </span><span class=\"f3 push-content\">: 去哪都比在黃蜂好</span><span class=\"push-ipdatetime\"> 05/15 17:40\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">DiAbLoE     </span><span class=\"f3 push-content\">: popo會變黑人</span><span class=\"push-ipdatetime\"> 05/15 17:41\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">smtp        </span><span class=\"f3 push-content\">: 豪哥今年曾說,年薪影響別人對你評價,可見企圖心很強</span><span class=\"push-ipdatetime\"> 05/15 17:41\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">happyalex   </span><span class=\"f3 push-content\">: 公道價8分鐘</span><span class=\"push-ipdatetime\"> 05/15 17:43\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">goury       </span><span class=\"f3 push-content\">: 千萬不要去馬刺...Linsanity的重點是明星度,而不是</span><span class=\"push-ipdatetime\"> 05/15 17:47\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">goury       </span><span class=\"f3 push-content\">: 馬刺這樣只在乎勝利,讓買票進場的人看不到球星的隊</span><span class=\"push-ipdatetime\"> 05/15 17:48\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">loverxa     </span><span class=\"f3 push-content\">: 到時不用豪神 就沒人在跟你用兵如神了啦</span><span class=\"push-ipdatetime\"> 05/15 17:49\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">crazywill   </span><span class=\"f3 push-content\">: 馬刺不會要阿豪 不用討論了</span><span class=\"push-ipdatetime\"> 05/15 17:50\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">birdman5656 </span><span class=\"f3 push-content\">: 先等波波帶完奧運吧</span><span class=\"push-ipdatetime\"> 05/15 17:51\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">yyyyy       </span><span class=\"f3 push-content\">: 馬刺應該以我豪為核心進行重建才對</span><span class=\"push-ipdatetime\"> 05/15 17:53\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">leocheng    </span><span class=\"f3 push-content\">: 馬刺喜歡跳投好的 怎麼可能愛豪鬼</span><span class=\"push-ipdatetime\"> 05/15 18:04\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">june0204    </span><span class=\"f3 push-content\">: 還是2000找康利吧</span><span class=\"push-ipdatetime\"> 05/15 18:12\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">HowWhy99    </span><span class=\"f3 push-content\">: 8.7分 不能再多了</span><span class=\"push-ipdatetime\"> 05/15 18:12\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">crazylin924 </span><span class=\"f3 push-content\">: 35分鐘 (一整季</span><span class=\"push-ipdatetime\"> 05/15 18:19\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">噓 </span><span class=\"f3 hl push-userid\">xman262     </span><span class=\"f3 push-content\">: 487秒</span><span class=\"push-ipdatetime\"> 05/15 18:24\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">噓 </span><span class=\"f3 hl push-userid\">ppo7741     </span><span class=\"f3 push-content\">: 8.7分,跟這篇文的分數一樣,不能再高了</span><span class=\"push-ipdatetime\"> 05/15 18:25\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">carey1119   </span><span class=\"f3 push-content\">: 8.7分鐘</span><span class=\"push-ipdatetime\"> 05/15 18:33\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">DurantKevin </span><span class=\"f3 push-content\">: 。。。</span><span class=\"push-ipdatetime\"> 05/15 18:34\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">qq1029qq    </span><span class=\"f3 push-content\">: 87分鐘無誤</span><span class=\"push-ipdatetime\"> 05/15 18:40\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">drew9992    </span><span class=\"f3 push-content\">: <a href=\"http://i.imgur.com/0XBGyQU.jpg\" target=\"_blank\" rel=\"nofollow\">http://i.imgur.com/0XBGyQU.jpg</a></span><span class=\"push-ipdatetime\"> 05/15 19:01\n</span></div><div class=\"richcontent\"><img src=\"//i.imgur.com/0XBGyQU.jpg\" alt=\"\" /></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">dai26       </span><span class=\"f3 push-content\">: 若能完全取代馬妞,大概30分鐘,而且可打滿第四節</span><span class=\"push-ipdatetime\"> 05/15 19:20\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">vicyong     </span><span class=\"f3 push-content\">: 不要在幻想了</span><span class=\"push-ipdatetime\"> 05/15 19:23\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">gotohikaru  </span><span class=\"f3 push-content\">: 第一年不用想</span><span class=\"push-ipdatetime\"> 05/15 19:32\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">waw20002002 </span><span class=\"f3 push-content\">: 8.7分鐘!不能在多了</span><span class=\"push-ipdatetime\"> 05/15 20:13\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">噓 </span><span class=\"f3 hl push-userid\">an565an565  </span><span class=\"f3 push-content\">: 後衛來馬刺都要被冰一整季</span><span class=\"push-ipdatetime\"> 05/15 21:07\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">trylovetom  </span><span class=\"f3 push-content\">: Parker不是被豪哥哥電?</span><span class=\"push-ipdatetime\"> 05/15 21:09\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">smtp        </span><span class=\"f3 push-content\">: Parker是輸在年紀, 巔峰時不可能會被豪哥電~</span><span class=\"push-ipdatetime\"> 05/15 21:12\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">噓 </span><span class=\"f3 hl push-userid\">Yginger1    </span><span class=\"f3 push-content\">: 42分鐘 全場打好打滿</span><span class=\"push-ipdatetime\"> 05/15 21:19\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">pickoff     </span><span class=\"f3 push-content\">: 馬刺不收。問下一隊謝謝</span><span class=\"push-ipdatetime\"> 05/15 21:22\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">mmshlovebob </span><span class=\"f3 push-content\">: 8.7分鐘 不能再多了</span><span class=\"push-ipdatetime\"> 05/15 21:36\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">roc074      </span><span class=\"f3 push-content\">: 8.7時</span><span class=\"push-ipdatetime\"> 05/15 22:04\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">噓 </span><span class=\"f3 hl push-userid\">deathgale   </span><span class=\"f3 push-content\">: 等加盟了再來討論,記者都自以為球團老闆...</span><span class=\"push-ipdatetime\"> 05/15 22:38\n</span></div><div class=\"push\"><span class=\"hl push-tag\">推 </span><span class=\"f3 hl push-userid\">squall410339</span><span class=\"f3 push-content\">: 快去啦馬刺黑掉就是爽</span><span class=\"push-ipdatetime\"> 05/15 23:29\n</span></div><div class=\"push\"><span class=\"f1 hl push-tag\">→ </span><span class=\"f3 hl push-userid\">raku        </span><span class=\"f3 push-content\">: 一開始一場應該是不到10分鐘...</span><span class=\"push-ipdatetime\"> 05/16 00:34\n</span></div></div>\n    \n    <div id=\"article-polling\" data-pollurl=\"/poll/NBA/M.1463302851.A.8D7.html?cacheKey=2052-14787853&offset=7844&offset-sig=05b0d881a2af63a68de9ce2d0f1cda56eec2d548\" data-longpollurl=\"/v1/longpoll?id=5ef93f8aa82d4cf930eb27310d147571cf13936a\" data-offset=\"7844\"></div>\n    \n\n    \n</div>\n\n    </body>\n</html>\n"
content(res, as = "parsed")
#> {xml_document}
#> <html>
#> [1] <head>\n\t\t<meta charset="utf-8"/>\n\t\t\n\n<meta name="viewport" c ...
#> [2] <body>\n\t\t\n<div id="fb-root"/>\n<script><![CDATA[(function(d, s,  ...

Beware of System Encoding

Encoding

?locales ?Encoding ?iconv iconvlist()

## check out your system locale
Sys.getlocale()
#> [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
aa <- "你好嗎"
Encoding(aa)
#> [1] "UTF-8"
charToRaw(aa)
#> [1] e4 bd a0 e5 a5 bd e5 97 8e

(aa_big5 <- iconv(aa, from = "UTF-8", to = "Big5"))
#> [1] "\xa7A\xa6n\xb6\xdc"
Encoding(aa_big5)
#> [1] "unknown"
harToRaw(aa_big5)
#> [1] a7 41 a6 6e b6 dc
# On Windows
Sys.getlocale()
#> "CP950"
aa <- "你好嗎"
Encoding(aa)
#> "unknown"
aa_utf8 <- iconv(aa, from = "Big5", to = "UTF-8")
Encoding(aa_utf8)
#> "UTF-8"

The secret of URL

URL?par1=val1&par2=val2

Query String

You can assign query parameter with query()

res1 <- GET(
  "http://ecshweb.pchome.com.tw/search/v3.3/all/results?q=sony&page=1&sort=rnk/dc"
)

res2 <- GET("http://ecshweb.pchome.com.tw/search/v3.3/all/results",
            query = list(q="sony", page="1", sort="rnk/dc"))

Concatenate strings / String formatting

q_string = "apple"
paste0("hihi", q_string)
#> [1] "hihiapple"
paste0("hihi", q_string, 1:3)  # recycling to length 3
#> [1] "hihiapple1" "hihiapple2" "hihiapple3"
paste("hihi", q_string, 1:3, sep = " ", collapse = ",")
#> [1] "hihi apple 1,hihi apple 2,hihi apple 3"
sprintf("%s, %s", "hihi", q_string) # often use this to constuct a url
#> [1] "hihi, apple"

Url Encoding in R

URLencode(" ")  # a space
#> [1] "%20"
q_string = "蘋果電腦"
(q_string_enc = URLencode(q_string))
#> [1] "%E8%98%8B%E6%9E%9C%E9%9B%BB%E8%85%A6"
URLdecode(q_string_enc)
#> [1] "蘋果電腦"



Parsing Data

Response Content

What to Parse

Structured data

Unstructured data

Parsing Response Content: Text Type

rvest

A web scraper designed to work with magrittr.

Use with rvest

GET(url) %>% content(as="text") %>% rvest::read_html()

Extract elements from HTML document

Tree Structure of HTML

(Document Object Model, DOM)

<a href = "www.meetup.com/Taiwan-R">
  Taiwan R User Group Website
</a>

A simple HTML document

demo_page

library(magrittr)
doc = readLines("http://leoluyi.github.io/RCrawler101_201605_Week2/resources/data/demo.html", encoding = "UTF-8") %>%
    paste(collapse = "\n")
cat(doc)
#> <!DOCTYPE HTML>
#> <html lang="zh-TW">
#> <head>
#>     <meta charset="UTF-8">
#>     <title>Document</title>
#> </head>
#> <html>
#>     <body>
#>         <div id='title' class='character'>
#>             <a class="link" href="http://data-sci.info/r-crawler-101/">data-sci.info</a>
#>         </div>
#>         <div id='summary' class='character'>
#>             <span class='title'>Number:</span>
#>             <span class='number'>2</span>
#>         </div>
#>         <div id='table1' class='info'>
#>             <table style="border-collapse: collapse; border: 1px solid black;">
#>                 <tr>
#>                     <th>Name</th>
#>                     <th>Gender</th>
#>                     <th>Age</th>
#>                 </tr>
#>                 <tr>
#>                     <td>Alice</td>
#>                     <td>Female</td>
#>                     <td>24</td>
#>                 </tr>
#>                 <tr>
#>                     <td>Jane</td>
#>                     <td>Female</td>
#>                     <td>26</td>
#>                 </tr>
#>             </table>
#>         </div>
#> 
#>         <h3>傷心排行榜</h3>
#>         <div id='table2' class='info'>
#>             <table style="border-collapse: collapse; border: 1px solid black;">
#>                 <tr>
#>                     <th>姓名</th>
#>                     <th>年齡</th>
#>                 </tr>
#>                 <tr>
#>                     <td>白素貞</td>
#>                     <td>15</td>
#>                 </tr>
#>                 <tr>
#>                     <td>聶小倩</td>
#>                     <td>32</td>
#>                 </tr>
#>                 <tr>
#>                     <td>祝英台</td>
#>                     <td>54</td>
#>                 </tr>
#>             </table>
#>         </div>
#>     </body>
#> </html>

Create HTML document object

library(rvest)
doc <- GET("http://leoluyi.github.io/RCrawler101_201605_Week2/resources/data/demo.html") %>% 
  content(as = "text", encoding = "UTF-8") %>% 
  read_html()
doc
#> {xml_document}
#> <html>
#> [1] <head>\n    <meta charset="UTF-8"/>\n    <title>Document</title>\n</ ...
#> [2] <body>\n        <div id="title" class="character">\n            <a c ...
class(doc)
#> [1] "xml_document" "xml_node"

Extract with CSS selector

CSS practice: http://flukeout.github.io/

css-diner-plate

doc <- read_html("http://leoluyi.github.io/RCrawler101_201605_Week2/resources/data/demo.html")
doc %>% 
  html_nodes(css = ".character") %>% # a node set
  as.character()
#> [1] "<div id=\"title\" class=\"character\">\n            <a class=\"link\" href=\"http://data-sci.info/r-crawler-101/\">data-sci.info</a>\n        </div>"   
#> [2] "<div id=\"summary\" class=\"character\">\n            <span class=\"title\">Number:</span>\n            <span class=\"number\">2</span>\n        </div>"
doc %>% 
  html_nodes(css = "#title > .link") %>% 
  as.character()
#> [1] "<a class=\"link\" href=\"http://data-sci.info/r-crawler-101/\">data-sci.info</a>"

html_node() vs html_nodes()

Extract with XPath

doc <- read_html("http://leoluyi.github.io/RCrawler101_201605_Week2/resources/data/demo.html")
doc %>% 
  html_nodes(xpath = "//*[@class='character']") %>% 
  as.character()
#> [1] "<div id=\"title\" class=\"character\">\n            <a class=\"link\" href=\"http://data-sci.info/r-crawler-101/\">data-sci.info</a>\n        </div>"   
#> [2] "<div id=\"summary\" class=\"character\">\n            <span class=\"title\">Number:</span>\n            <span class=\"number\">2</span>\n        </div>"
doc %>% 
  html_nodes(xpath = "//div[@id='title']/a") %>% 
  as.character()
#> [1] "<a class=\"link\" href=\"http://data-sci.info/r-crawler-101/\">data-sci.info</a>"

Extract text

doc %>% 
  html_nodes(xpath = "//*[@class='character']") %>% 
  html_text()
#> [1] "\n            data-sci.info\n        "         
#> [2] "\n            Number:\n            2\n        "
doc %>% 
  html_nodes(xpath = "//div[@id='title']/a") %>% 
  html_text()
#> [1] "data-sci.info"

Extract name of tag

node = doc %>% 
    html_nodes(css = "#summary") %>% 
    html_name
node
#> [1] "div"

Exercise: 抓PTT推文

https://www.ptt.cc/bbs/Gossiping/M.1464355692.A.0E5.html


## 10-min break

Example: PTT

use package xmlview to parse XML content and extract elements.

Extract Table Using rvest::html_table()

(html_table() is temperarily not working with non-ascii data on Windows.) demo_page

students <- read_html("http://leoluyi.github.io/RCrawler101_201605_Week2/resources/data/demo.html") %>% 
    # html_nodes(css = "#table1>table") %>%  # also can select table first
    html_table()
students
#> [[1]]
#>    Name Gender Age
#> 1 Alice Female  24
#> 2  Jane Female  26
#> 
#> [[2]]
#>     姓名 年齡
#> 1 白素貞   15
#> 2 聶小倩   32
#> 3 祝英台   54

Extract Table Using XML::readHTMLTable()

demo_page

res_text <- GET("http://leoluyi.github.io/RCrawler101_201605_Week2/resources/data/demo.html") %>% 
  content("text", encoding = "UTF-8") %>% 
  `Encoding<-`("UTF-8")

# read and extract tables from html string
tables <- res_text %>% XML::readHTMLTable(encoding = "UTF-8")
str(tables) # take a look 
View(tables[[2]]) # take a look 
tables[[2]]
#>     姓名 年齡
#> 1 白素貞   15
#> 2 聶小倩   32
#> 3 祝英台   54

Example: Yahoo Stock

Example: 公開資訊觀測站

Parse XML table

Parse XML table with XML::xmlToDataFrame()

Example

Parse JSON format

Yun can parse JSON with jsonlite in R


Parse JSON with jsonlite

library(jsonlite)
library(magrittr)
res <- GET("http://ecshweb.pchome.com.tw/search/v3.3/all/results?q=sony&page=1&sort=rnk/dc")
res_df <- content(res, as = "text", encoding = "UTF-8") %>% 
  `Encoding<-`("UTF-8") %>% 
  fromJSON() %>%
  .$prods  # equivelent to (function(x) {x$prods})
head(res_df)
#>                 Id cateId
#> 1 DPAE03-A9006RR9X DPAE03
#> 2 DYAD2P-A9006JOZX DYAD2P
#> 3 DGAF3L-A9005R0IB DGAF3L
#> 4 DYAD2O-A9006JDQE DYAD2O
#> 5 DYAD2R-A900776LD DYAD2R
#> 6 DYAD2R-A9006XKSI DYAD2R
#>                                                                          picS
#> 1 /pic/v1/data/item/201601/D/P/A/E/0/3/sDPAE03-A9006RR9X000_56970708b9924.jpg
#> 2 /pic/v1/data/item/201510/D/Y/A/D/2/P/sDYAD2P-A9006JOZX000_5614b4cc793f6.jpg
#> 3 /pic/v1/data/item/201603/D/G/A/F/3/L/sDGAF3L-A9005R0IB000_56e13e653c0a5.jpg
#> 4 /pic/v1/data/item/201510/D/Y/A/D/2/O/sDYAD2O-A9006JDQE000_560e037b08db2.jpg
#> 5 /pic/v1/data/item/201604/D/Y/A/D/2/R/sDYAD2R-A900776LD000_5718b07ab2c7c.jpg
#> 6 /pic/v1/data/item/201604/D/Y/A/D/2/R/sDYAD2R-A9006XKSI000_56fe3f0b20c2e.jpg
#>                                                                         picB
#> 1 /pic/v1/data/item/201605/D/P/A/E/0/3/DPAE03-A9006RR9X000_5747a5483899f.jpg
#> 2 /pic/v1/data/item/201510/D/Y/A/D/2/P/DYAD2P-A9006JOZX000_5614b4cc76ca5.jpg
#> 3 /pic/v1/data/item/201605/D/G/A/F/3/L/DGAF3L-A9005R0IB000_5747aa8e441fe.jpg
#> 4 /pic/v1/data/item/201604/D/Y/A/D/2/O/DYAD2O-A9006JDQE000_57198b2dc111b.jpg
#> 5 /pic/v1/data/item/201605/D/Y/A/D/2/R/DYAD2R-A900776LD000_57426f72542d3.jpg
#> 6 /pic/v1/data/item/201605/D/Y/A/D/2/R/DYAD2R-A9006XKSI000_57426fcf0b981.jpg
#>                                            name
#> 1                    Sony 行動微型投影機 MP-CL1
#> 2  Sony Xperia Z5 Compact E5823 4.6吋八核輕旗艦
#> 3 SONY 錄音筆 ICD-PX440 立體音 4GB 【中文平輸】
#> 4                                SONY Xperia Z5
#> 5              SONY Xperia Z5 Premium玫瑰石英粉
#> 6                        SONY Xperia Z5 Premium
#>                                                           describe price
#> 1                                       Sony 行動微型投影機 MP-CL1 11900
#> 2 送9H玻璃保貼+保護套 Sony Xperia Z5 Compact E5823 4.6吋八核輕旗艦 13880
#> 3                  SONY 錄音筆 ICD-PX440 立體音 4GB 【中文平輸】    2880
#> 4     週末超值降▼送保護貼+保護套SONY Xperia Z5 5.2吋美型防水旗艦機 16490
#> 5   ▼送Kitty側掀皮套or32G卡+玻璃貼SONY Xperia Z5 Premium玫瑰石英粉 19999
#> 6                   ▼送32G卡+手機立架+玻璃貼SONY Xperia Z5 Premium 19900
#>   author brand publishDate isPick
#> 1                               0
#> 2                               0
#> 3                               0
#> 4                               0
#> 5                               0
#> 6                               0

httr auto parse json as list

res_list <- content(res, as = "parsed")
# str(res_list)
res_list$prods[[1]]
#> $Id
#> [1] "DPAE03-A9006RR9X"
#> 
#> $cateId
#> [1] "DPAE03"
#> 
#> $picS
#> [1] "/pic/v1/data/item/201601/D/P/A/E/0/3/sDPAE03-A9006RR9X000_56970708b9924.jpg"
#> 
#> $picB
#> [1] "/pic/v1/data/item/201605/D/P/A/E/0/3/DPAE03-A9006RR9X000_5747a5483899f.jpg"
#> 
#> $name
#> [1] "Sony 行動微型投影機 MP-CL1"
#> 
#> $describe
#> [1] "Sony 行動微型投影機 MP-CL1"
#> 
#> $price
#> [1] 11900
#> 
#> $author
#> [1] ""
#> 
#> $brand
#> [1] ""
#> 
#> $publishDate
#> [1] ""
#> 
#> $isPick
#> [1] 0

Combind list of dataframes with loop

# str(res_list$prods)
res_df2 = data.frame()
for (i in 1:length(res_list$prods)) {
  res_df2 = rbind(res_df2, 
                  data.frame(res_list$prods[[i]], 
                             stringsAsFactors = FALSE))
}
identical(res_df, res_df2)
#> [1] TRUE

(better use do.call + rbind)

More general way to combine list of dataframes

Parse Parse list of dataframes with

do.call + rbind

## first make a list of dataframes
df_list <- lapply(res_list$prods, as.data.frame, stringsAsFactors = FALSE)
# head(res_list$prods, 3) %>% str
# head(df_list, 3) %>% str
# combine dataframes
res_df3 = do.call(rbind, df_list)
identical(res_df, res_df3)
#> [1] TRUE
# str(res_df3)

data.table::rbindlist

library(data.table)
res_df4 = rbindlist(df_list)
head(res_df4)
#>                  Id cateId
#> 1: DPAE03-A9006RR9X DPAE03
#> 2: DYAD2P-A9006JOZX DYAD2P
#> 3: DGAF3L-A9005R0IB DGAF3L
#> 4: DYAD2O-A9006JDQE DYAD2O
#> 5: DYAD2R-A900776LD DYAD2R
#> 6: DYAD2R-A9006XKSI DYAD2R
#>                                                                           picS
#> 1: /pic/v1/data/item/201601/D/P/A/E/0/3/sDPAE03-A9006RR9X000_56970708b9924.jpg
#> 2: /pic/v1/data/item/201510/D/Y/A/D/2/P/sDYAD2P-A9006JOZX000_5614b4cc793f6.jpg
#> 3: /pic/v1/data/item/201603/D/G/A/F/3/L/sDGAF3L-A9005R0IB000_56e13e653c0a5.jpg
#> 4: /pic/v1/data/item/201510/D/Y/A/D/2/O/sDYAD2O-A9006JDQE000_560e037b08db2.jpg
#> 5: /pic/v1/data/item/201604/D/Y/A/D/2/R/sDYAD2R-A900776LD000_5718b07ab2c7c.jpg
#> 6: /pic/v1/data/item/201604/D/Y/A/D/2/R/sDYAD2R-A9006XKSI000_56fe3f0b20c2e.jpg
#>                                                                          picB
#> 1: /pic/v1/data/item/201605/D/P/A/E/0/3/DPAE03-A9006RR9X000_5747a5483899f.jpg
#> 2: /pic/v1/data/item/201510/D/Y/A/D/2/P/DYAD2P-A9006JOZX000_5614b4cc76ca5.jpg
#> 3: /pic/v1/data/item/201605/D/G/A/F/3/L/DGAF3L-A9005R0IB000_5747aa8e441fe.jpg
#> 4: /pic/v1/data/item/201604/D/Y/A/D/2/O/DYAD2O-A9006JDQE000_57198b2dc111b.jpg
#> 5: /pic/v1/data/item/201605/D/Y/A/D/2/R/DYAD2R-A900776LD000_57426f72542d3.jpg
#> 6: /pic/v1/data/item/201605/D/Y/A/D/2/R/DYAD2R-A9006XKSI000_57426fcf0b981.jpg
#>                                             name
#> 1:                    Sony 行動微型投影機 MP-CL1
#> 2:  Sony Xperia Z5 Compact E5823 4.6吋八核輕旗艦
#> 3: SONY 錄音筆 ICD-PX440 立體音 4GB 【中文平輸】
#> 4:                                SONY Xperia Z5
#> 5:              SONY Xperia Z5 Premium玫瑰石英粉
#> 6:                        SONY Xperia Z5 Premium
#>                                                            describe price
#> 1:                                       Sony 行動微型投影機 MP-CL1 11900
#> 2: 送9H玻璃保貼+保護套 Sony Xperia Z5 Compact E5823 4.6吋八核輕旗艦 13880
#> 3:                  SONY 錄音筆 ICD-PX440 立體音 4GB 【中文平輸】    2880
#> 4:     週末超值降▼送保護貼+保護套SONY Xperia Z5 5.2吋美型防水旗艦機 16490
#> 5:   ▼送Kitty側掀皮套or32G卡+玻璃貼SONY Xperia Z5 Premium玫瑰石英粉 19999
#> 6:                   ▼送32G卡+手機立架+玻璃貼SONY Xperia Z5 Premium 19900
#>    author brand publishDate isPick
#> 1:                               0
#> 2:                               0
#> 3:                               0
#> 4:                               0
#> 5:                               0
#> 6:                               0

Exercise: UV API (JSON)

行政院環境保護署環境資源資料開放平台提供了一系列的RESTful Api供大家取用,請試著把 紫外線即時監測資料的資料取回來並轉成data.frame

Answer

紫外線即時監測資料

Unstructured data

非結構化資料解析: Regular Expression

Package stringr

library(stringr)
fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "[aeiou]", "-")
#> [1] "-ne apple"     "tw- pears"     "thr-e bananas"
str_replace_all(fruits, "[aeiou]", "-")
#> [1] "-n- -ppl-"     "tw- p--rs"     "thr-- b-n-n-s"
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract(shopping_list, "[a-z]+")
#> [1] "apples" "bag"    "bag"    "milk"
str_extract_all(shopping_list, "[a-z]+")
#> [[1]]
#> [1] "apples" "x"     
#> 
#> [[2]]
#> [1] "bag"   "of"    "flour"
#> 
#> [[3]]
#> [1] "bag"   "of"    "sugar"
#> 
#> [[4]]
#> [1] "milk" "x"

Exercise : ETwarm



Save your data

Save your data

Working Environment

Check your working directory everytime you start to work!

## Not run
getwd() # current config
setwd("/xxx/xxxxx") # set working directory to "/xxx/xxxxx"
setwd("..")  # set working directory to up one directory
setwd("./xxxx")  # relative path to current directory

write.csv() / data.table::fwrite()

library(jsonlite)
library(httr)
url = "http://ecshweb.pchome.com.tw/search/v3.3/all/results?q=sony&page=1&sort=rnk/dc"
res_df = GET(url) %>% 
    content(res, as = "text") %>% 
    fromJSON() %>% 
    .$prods     # equivelent to (function(x) {x$prods})
write.csv(res_df, "pchome.csv", row.names = FALSE)

library(data.table) # >= 1.9.7
fwrite(res_df, "pchome.csv")  # very fast!

download.file()

To download a file from the Internet. download.file takes advantage of internet utilities such as curl or wget and may fail if you don’t have any of these utilities in your system.

dest_dir = "resources/data/download"
dir.create(dest_dir, showWarnings = FALSE, recursive = TRUE)

# Download whole HTML file
download.file("https://www.r-project.org/", 
              file.path(dest_dir, "r-project.org.html"))

# Download image
download.file("https://www.r-project.org/Rlogo.png",
              file.path(dest_dir, "Rlogo.png"))

list.files(dest_dir)

writeBin()

To write binary data to your local disk.

dest_dir = "resources/data/download"

r = GET("http://opendata.epa.gov.tw/webapi/api/rest/datastore/355000000I-000004/?format=json")
# Set as = "raw" to prevent any character encoding
bin = content(r, as = "raw")
writeBin(bin, file.path(dest_dir, "uv.json"))

Database

Case study

References

第二週課程課後問卷

恭喜大家完成第二周的學習,你已經相當會使用R來模仿瀏覽器行為,連線取得資料並解析了嘛?如果還不熟悉的同學可以把握周間的空檔複習,課程中的助教與Mentor非常願意解決大家的問題。

另外大家也可以描述問題Po在論壇 forums上面。 (論壇使用教學)

我們的目的,就是在大家學習上,提供更多面向的支持,順利完成學習目標,並且能持續學習成長。大家寶貴的意見都能立即提供給我們,以設計和調整出更完善臻至的課程與服務。

rstudio

http://goo.gl/forms/pxKPMz6uiidkSsH52